Chapter 1: Introduction

Capital Bikeshare (also called CapBi) is a bicycle sharing system that serves Washington DC, Arlington County, Alexandria, Falls Church, Montgomery County, Prince George’s County, and Fairfax County. The Capital Bikeshare system is owned by the local governments and is operated by Motivate International, Inc.(Motivate International, Inc). As of August 2019, Capital Bike has 500 stations and 4300 bicycles.

The distribution of the docks is shown below:

As we can see from the above image, the majority of the docks for the bicycle are in Washington DC.

Bike tours in Washington DC are not only a popular family activity but renting a bike is a great way to get around without breaking the bank or sitting in traffic. There are dedicated bike lanes in Washington DC hence there is safety and convenience for the rider.

Capital BikeShare is undoubtedly cheaper than its competitors and the docks are conveniently placed around monumental locations. Capital Bikeshare is often faster than other modes of transportation and its annual membership offers unlimited trips under 30 minutes which helps save money. CapBi can be used to commute to work or ride to meet friends and is a great alternative for exercise since it is human-powered instead of electric powered. CapBi services save fuel, prevents carbon emissions, it is not only healthy for the rider but also for the environment.

As CapBi services are very popular and always in demand, we want to predict the number of bikes riders will use per hour and have contingencies to fulfill the demand. To estimate the number of bikes required we will consider various factors such as weather, temperature, working or non-working hour, the hour of the day, etc.

Fun Fact: CapBi offers GWU students annual membership for only 25$.(“Capital Bikeshare Discount”)

Chapter 2: Description of Data

2.1 Source of data

The data is sourced from the official Capital Bikeshare website, https://www.capitalbikeshare.com/system-data. We have downloaded data for September 2013 to September 2019.

The official data contains only the following variables:

Variable Description
Duration Duration of trip
Start Date Includes start date and time
End Date Includes end date and time
Start Station Includes starting station name and number
End Station Includes ending station name and number
Bike Number Includes ID number of bike used for the trip
Member Type Indicates whether user was a “registered” member (Annual Member, 30-Day Member or Day Key Member) or a “casual” rider (Single Trip, 24-Hour Pass, 3-Day Pass or 5-Day Pass)

We dropped irrelevant columns like Duration, End Date, End Station, Bike Number and Member Type from official capital bike share dataset.

To predict the number of bikes to be used hourly we scraped the weather data from following website: https://www.wunderground.com/history/daily/us/va/arlington-county/KDCA/.

To also understand whether holiday influences the increase or decrease in bike usage we downloaded the holiday dataset from https://www.kaggle.com/gsnehaa21/federal-holidays-usa-19662020.

We merged all the different data sources into a single file and the structure for that file is as follows:

## 'data.frame':    6627646 obs. of  16 variables:
##  $ Start.date      : Factor w/ 52101 levels "2013-10-01 00:00:00",..: 2 9 10 11 12 13 14 15 16 17 ...
##  $ Start.station   : Factor w/ 658 levels "10th & E St NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Condition       : Factor w/ 47 levels "","Cloudy","Cloudy / Windy",..: 29 2 27 27 29 29 29 29 29 29 ...
##  $ Wind            : Factor w/ 19 levels "","CALM","E",..: 15 19 2 3 12 6 6 10 19 17 ...
##  $ Temperature..F. : num  63 64 66 67 71 79 81 83 84 84 ...
##  $ Dew.Point..F.   : num  55 56 56 57 58 58 58 58 55 55 ...
##  $ Humidity....    : num  75 75 70 70 63 48 45 42 37 37 ...
##  $ Wind.Speed..mph.: num  3 3 0 3 3 7 8 6 6 9 ...
##  $ Wind.Gust..mph. : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pressure..in.   : num  30 30.1 30.1 30.1 30.1 ...
##  $ Precip...in.    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Holiday         : Factor w/ 10 levels "Birthday of Martin Luther King, Jr.",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ weekday         : Factor w/ 2 levels "Weekday","Weekend": 1 1 1 1 1 1 1 1 1 1 ...
##  $ timeOfDay       : Factor w/ 2 levels "Non Working Hour",..: 1 1 1 2 2 2 2 2 2 2 ...
##  $ season          : Factor w/ 4 levels "Fall","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ noOfBikes       : int  1 2 1 3 2 4 3 3 2 5 ...

2.2 Preprocessing of Data

Once the merging of data is done we will preprocess our data. We dropped irrelevant columns like Start.station, Wind, Wind.Gust..mph., Pressure..in., Precip…in. as they are not useful for our analysis.

The variable Condition has 47 unique levels:

##  [1]                              Cloudy                      
##  [3] Cloudy / Windy               Fair                        
##  [5] Fair / Windy                 Fog                         
##  [7] Haze                         Heavy Rain                  
##  [9] Heavy Rain / Windy           Heavy Snow                  
## [11] Heavy T-Storm                Heavy T-Storm / Windy       
## [13] Light Drizzle                Light Drizzle / Windy       
## [15] Light Freezing Drizzle       Light Freezing Rain         
## [17] Light Rain                   Light Rain / Windy          
## [19] Light Rain with Thunder      Light Sleet                 
## [21] Light Sleet / Windy          Light Snow                  
## [23] Light Snow / Windy           Light Snow and Sleet        
## [25] Light Snow and Sleet / Windy Mist                        
## [27] Mostly Cloudy                Mostly Cloudy / Windy       
## [29] Partly Cloudy                Partly Cloudy / Windy       
## [31] Patches of Fog               Rain                        
## [33] Rain / Windy                 Rain and Sleet              
## [35] Rain and Snow                Shallow Fog                 
## [37] Sleet                        Snow                        
## [39] Snow and Sleet               Squalls / Windy             
## [41] T-Storm                      T-Storm / Windy             
## [43] Thunder                      Thunder / Windy             
## [45] Thunder in the Vicinity      Wintry Mix                  
## [47] Wintry Mix / Windy          
## 47 Levels:  Cloudy Cloudy / Windy Fair Fair / Windy Fog ... Wintry Mix / Windy

We condense Condition column from 47 levels into 6 levels.

If condition is Cloudy,Cloudy / Windy,Mostly Cloudy,Mostly Cloudy / Windy,Partly Cloudy,Partly Cloudy / Windy we replace it by Cloudy alone. Similar logic is used of other weather conditions as well.

We finally have the following levels in Condition column:

## [1] "Cloudy" "Fair"   "Fog"    "Rain"   "Snow"   "Windy"

The Holiday column has the following levels:

##  [1] <NA>                               
##  [2] Columbus Day                       
##  [3] Veterans Day                       
##  [4] Thanksgiving Day                   
##  [5] Christmas Day                      
##  [6] New Year's Day                     
##  [7] Birthday of Martin Luther King, Jr.
##  [8] Washington's Birthday              
##  [9] Memorial Day                       
## [10] Independence Day                   
## [11] Labor Day                          
## 10 Levels: Birthday of Martin Luther King, Jr. ... Washington's Birthday

We convert the Holiday column from factors into a binary column where 0 means no Holiday and 1 means Holiday.

Since our dataset has information about the number of bikes used per hour across various stations, we want to simplify this thus we aggregate all the CapBi data on an hourly basis.

We also rename the columns for ease of use. We use the lubridate package to extract the hour, month, day and year from the Start_Date column which is of type character. For example 2019-09-20 18:00:00 is the date and thus the hour is 18, month is 09, day is 20 and year is 2019.

We convert the following columns to factors HourOfDay, Month, Year, Day, Condition, Holiday, Weekday, TimeofDay, Season and the final processed dataframe looks as follows:

## Classes 'tbl_df', 'tbl' and 'data.frame':    52097 obs. of  15 variables:
##  $ Start_Date : Factor w/ 52101 levels "2013-10-01 00:00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Condition  : Factor w/ 6 levels "Cloudy","Fair",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ Temp       : num  62 63 63 61 60 61 61 63 64 66 ...
##  $ Dew        : num  52 55 55 55 55 55 56 56 56 56 ...
##  $ Humidity   : num  70 75 75 81 83 81 83 78 75 70 ...
##  $ Windspeed  : num  3 3 5 6 5 5 5 3 3 0 ...
##  $ Holiday    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weekday    : Factor w/ 2 levels "Weekday","Weekend": 1 1 1 1 1 1 1 1 1 1 ...
##  $ TimeofDay  : Factor w/ 2 levels "Non Working Hour",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Season     : Factor w/ 4 levels "Fall","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ HourOfDay  : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Month      : Factor w/ 12 levels "1","2","3","4",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ Day        : Factor w/ 31 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year       : Factor w/ 7 levels "2013","2014",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Total_Bikes: int  38 41 23 5 4 21 108 396 1044 667 ...
The first 6 rows of the final processed dataset are:
Start_Date Condition Temp Dew Humidity Windspeed Holiday Weekday TimeofDay Season HourOfDay Month Day Year Total_Bikes
2013-10-01 00:00:00 Cloudy 62 52 70 3 0 Weekday Non Working Hour Fall 0 10 1 2013 38
2013-10-01 01:00:00 Cloudy 63 55 75 3 0 Weekday Non Working Hour Fall 1 10 1 2013 41
2013-10-01 02:00:00 Cloudy 63 55 75 5 0 Weekday Non Working Hour Fall 2 10 1 2013 23
2013-10-01 03:00:00 Fair 61 55 81 6 0 Weekday Non Working Hour Fall 3 10 1 2013 5
2013-10-01 04:00:00 Fair 60 55 83 5 0 Weekday Non Working Hour Fall 4 10 1 2013 4
2013-10-01 05:00:00 Cloudy 61 55 81 5 0 Weekday Non Working Hour Fall 5 10 1 2013 21

Chapter 4: Exploratory Data Analysis

4.1 SMART Question:

What is the total number of bikes rented at a given hour based on temperature and season?

4.2 Bike Demand by Season and Temperature

From the plots below we can see that winter is the least favorite season for hiring bikes while spring, summer, and fall have pretty similar patterns.

In this plot we also include the temperature, and observe that higher numbers of bikes are rented in each season when temperatures are between 80-90 degree Fahrenheit.

4.3 Bike Demand by Weather Conditions

The plot shows that people like to bike most in cloudy weather, followed by fair. Rain, snow, windy etc. are not preferred.

4.4 Bike Demand by Year, Weekday & Hour of the Day

There is a steady increase in the number of bikes rented up to the year 2017 and then it decreased in 2018. Also, more bikes are hired during the weekday as compared to weekends.

The bikes hired peak during morning and evening 8 AM and 6 PM rush hours when people are heading or returning back from work.

4.5 Bike Demand by Holiday

We notice that riders rent bike more often on days when there is no holiday, but the number of bikes rented during holidays is still significant.

4.6 Correlation Between Bikes Hired and Weather

There is a positive 44% correlation between temperature and bikes hired, additionally, Humidity has a negative correlation of 30%.

The correlation table is as follows:
Total_Bikes Temp Dew Humidity Windspeed HourOfDay
Total_Bikes 1.00 0.44 0.24 -0.30 0.10 0.42
Temp 0.44 1.00 0.89 0.08 -0.07 0.14
Dew 0.24 0.89 1.00 0.50 -0.18 -0.01
Humidity -0.30 0.08 0.50 1.00 -0.29 -0.29
Windspeed 0.10 -0.07 -0.18 -0.29 1.00 0.15
HourOfDay 0.42 0.14 -0.01 -0.29 0.15 1.00

The correlation plot is as follows:

Chapter 5: Model Assessment

5.1 Splitting into Train and Test Data

For our analysis we scale all the numeric variables like Temperature, Dew, Humidity and WindSpeed to avoid skewed results.

We now split our dataset into Train and Test splits. For the years 2013, 2014, 2015, 2016, 2017 and 2018 we are considering these samples for training our model and using the sample for the year 2019 we will validate the performance of the model.

In our training set we have total 45557 samples and for testing our model we have total 6540 samples.

We also drop irrelevant columns like Start_Date, Month, Day and Year in our training and test set, as these are not useful when creating models.

As we need to predict the number of bikes which is a numerical value, we have to perform regression analysis. We will create models using Linear Regression and then try to optimize it. We will use Decision Tree for regression and also try Bagged Decision Trees. Finally we will create a random forest model and try to tune it to get best results.

5.2 Linear Regression Model

We perform linear regression, using training data set and all the variables we predict the number of bikes per hour which is our ‘y’ variable.

The summary of linear model is as follows:

## 
## Call:
## lm(formula = y ~ ., data = Training_Set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -804.17 -117.77  -13.27  103.45 1114.28 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            163.028      5.049  32.290  < 2e-16 ***
## ConditionFair           10.105      2.499   4.044 5.27e-05 ***
## ConditionFog             5.225     11.510   0.454 0.649893    
## ConditionRain         -132.889      3.903 -34.046  < 2e-16 ***
## ConditionSnow          -38.918     11.450  -3.399 0.000677 ***
## ConditionWindy          -1.231     14.789  -0.083 0.933687    
## Temp                   157.142      6.888  22.814  < 2e-16 ***
## Dew                    -17.833      7.864  -2.268 0.023355 *  
## Humidity               -39.063      3.597 -10.861  < 2e-16 ***
## Windspeed               -8.733      1.014  -8.615  < 2e-16 ***
## Holiday1               -70.324      5.780 -12.167  < 2e-16 ***
## WeekdayWeekend         -25.409      2.110 -12.043  < 2e-16 ***
## TimeofDayWorking Hour   -1.436      6.794  -0.211 0.832550    
## SeasonSpring             9.831      2.739   3.590 0.000331 ***
## SeasonSummer           -48.257      3.127 -15.433  < 2e-16 ***
## SeasonWinter           -27.934      3.220  -8.676  < 2e-16 ***
## HourOfDay1             -34.212      6.569  -5.208 1.92e-07 ***
## HourOfDay2             -48.474      6.573  -7.375 1.67e-13 ***
## HourOfDay3             -59.357      6.614  -8.975  < 2e-16 ***
## HourOfDay4             -64.500      6.639  -9.716  < 2e-16 ***
## HourOfDay5             -38.759      6.580  -5.891 3.87e-09 ***
## HourOfDay6              48.466      6.577   7.369 1.75e-13 ***
## HourOfDay7             249.983      6.577  38.006  < 2e-16 ***
## HourOfDay8             610.005      6.568  92.873  < 2e-16 ***
## HourOfDay9             538.675      7.086  76.022  < 2e-16 ***
## HourOfDay10            268.663      9.444  28.449  < 2e-16 ***
## HourOfDay11            250.066      9.460  26.433  < 2e-16 ***
## HourOfDay12            331.171      9.487  34.907  < 2e-16 ***
## HourOfDay13            343.751      9.518  36.115  < 2e-16 ***
## HourOfDay14            312.528      9.549  32.731  < 2e-16 ***
## HourOfDay15            317.677      9.568  33.201  < 2e-16 ***
## HourOfDay16            403.508      9.576  42.138  < 2e-16 ***
## HourOfDay17            661.644      9.565  69.170  < 2e-16 ***
## HourOfDay18            771.909      8.041  95.998  < 2e-16 ***
## HourOfDay19            514.892      6.642  77.520  < 2e-16 ***
## HourOfDay20            321.842      6.599  48.773  < 2e-16 ***
## HourOfDay21            199.693      6.572  30.387  < 2e-16 ***
## HourOfDay22            131.580      6.558  20.065  < 2e-16 ***
## HourOfDay23             57.800      6.552   8.822  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 202.2 on 45518 degrees of freedom
## Multiple R-squared:  0.6971, Adjusted R-squared:  0.6968 
## F-statistic:  2757 on 38 and 45518 DF,  p-value: < 2.2e-16

We notice that the R-squared value for linear model with all the variables is 0.6971009.

We want to know how much of each variable contributes to the Linear model R-squared value i.e we want to know the relative importance of each variable in Linear model. For this we make use of the relaimpo package, and use the function calc.relimp.

LMG
Condition 0.0256185
Season 0.0424971
HourOfDay 0.5769237
Temp 0.1340105
Dew 0.0643029
Humidity 0.0674104
Windspeed 0.0052728
Holiday 0.0017618
Weekday 0.0014998
TimeofDay 0.0807025

5.2.1 Feature selection on Linear Regression model

We have already evaluated the performance of linear model using all predictor variables, we now perform feature selection so that we can create a linear model with a subset of the variables without compromising on the accuracy.

The output for forward feature selection is as follows:

We observe that variables like Temp, TimeofDayWorkingHour, HourOfDay8, and HourOfDay18 are important.

The output for backward feature selection is as follows:

Similarly we observe that variables like Temp, Humidity, HourOfDay8, HourOfDay12, HourOfDay13 and HourOfDay18 are important.

5.3 Optimize the Multiple Linear Regression Model

First, we run the model with four selected variables Temp, Dew, TimeofDay, HourOfDay based on feature selection done in section 5.2.1.

## 
## Call:
## lm(formula = y ~ Temp + Dew + TimeofDay + HourOfDay, data = Training_Set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1023.34  -115.83    -7.61   102.98  1105.66 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            134.798      4.767  28.276  < 2e-16 ***
## Temp                   256.484      2.537 101.086  < 2e-16 ***
## Dew                   -133.618      2.464 -54.225  < 2e-16 ***
## TimeofDayWorking Hour   -2.563      6.965  -0.368    0.713    
## HourOfDay1             -31.786      6.736  -4.719 2.38e-06 ***
## HourOfDay2             -46.761      6.738  -6.940 3.98e-12 ***
## HourOfDay3             -56.292      6.778  -8.306  < 2e-16 ***
## HourOfDay4             -60.978      6.801  -8.967  < 2e-16 ***
## HourOfDay5             -35.352      6.739  -5.246 1.56e-07 ***
## HourOfDay6              51.203      6.734   7.604 2.94e-14 ***
## HourOfDay7             251.431      6.735  37.334  < 2e-16 ***
## HourOfDay8             610.836      6.729  90.777  < 2e-16 ***
## HourOfDay9             540.248      7.262  74.392  < 2e-16 ***
## HourOfDay10            267.754      9.678  27.666  < 2e-16 ***
## HourOfDay11            245.276      9.688  25.317  < 2e-16 ***
## HourOfDay12            322.804      9.704  33.264  < 2e-16 ***
## HourOfDay13            330.735      9.725  34.007  < 2e-16 ***
## HourOfDay14            296.593      9.745  30.435  < 2e-16 ***
## HourOfDay15            299.180      9.759  30.658  < 2e-16 ***
## HourOfDay16            383.777      9.762  39.312  < 2e-16 ***
## HourOfDay17            642.766      9.756  65.882  < 2e-16 ***
## HourOfDay18            754.525      8.195  92.072  < 2e-16 ***
## HourOfDay19            500.691      6.772  73.936  < 2e-16 ***
## HourOfDay20            313.747      6.746  46.510  < 2e-16 ***
## HourOfDay21            194.693      6.730  28.929  < 2e-16 ***
## HourOfDay22            129.808      6.722  19.310  < 2e-16 ***
## HourOfDay23             57.118      6.718   8.502  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 207.3 on 45530 degrees of freedom
## Multiple R-squared:  0.6813, Adjusted R-squared:  0.6811 
## F-statistic:  3744 on 26 and 45530 DF,  p-value: < 2.2e-16

Then, we remove the Timeofday as it has a high p-value and thus is not significant, and rerun the model with three remaining variables.

The summary of linear model, using Temp, Dew, HourOfDay as predictors is as follows:

## 
## Call:
## lm(formula = y ~ Temp + Dew + HourOfDay, data = Training_Set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1023.35  -115.83    -7.61   103.03  1105.65 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  134.798      4.767  28.277  < 2e-16 ***
## Temp         256.487      2.537 101.089  < 2e-16 ***
## Dew         -133.622      2.464 -54.228  < 2e-16 ***
## HourOfDay1   -31.785      6.736  -4.719 2.38e-06 ***
## HourOfDay2   -46.761      6.738  -6.940 3.98e-12 ***
## HourOfDay3   -56.292      6.777  -8.306  < 2e-16 ***
## HourOfDay4   -60.978      6.800  -8.967  < 2e-16 ***
## HourOfDay5   -35.352      6.739  -5.246 1.56e-07 ***
## HourOfDay6    51.203      6.734   7.604 2.93e-14 ***
## HourOfDay7   251.431      6.735  37.335  < 2e-16 ***
## HourOfDay8   610.836      6.729  90.778  < 2e-16 ***
## HourOfDay9   539.236      6.721  80.230  < 2e-16 ***
## HourOfDay10  265.191      6.719  39.467  < 2e-16 ***
## HourOfDay11  242.712      6.733  36.051  < 2e-16 ***
## HourOfDay12  320.240      6.755  47.409  < 2e-16 ***
## HourOfDay13  328.170      6.784  48.374  < 2e-16 ***
## HourOfDay14  294.028      6.811  43.167  < 2e-16 ***
## HourOfDay15  296.615      6.830  43.426  < 2e-16 ***
## HourOfDay16  381.211      6.836  55.767  < 2e-16 ***
## HourOfDay17  640.201      6.827  93.772  < 2e-16 ***
## HourOfDay18  752.841      6.799 110.723  < 2e-16 ***
## HourOfDay19  500.690      6.772  73.936  < 2e-16 ***
## HourOfDay20  313.746      6.746  46.511  < 2e-16 ***
## HourOfDay21  194.693      6.730  28.929  < 2e-16 ***
## HourOfDay22  129.808      6.722  19.310  < 2e-16 ***
## HourOfDay23   57.118      6.718   8.502  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 207.3 on 45531 degrees of freedom
## Multiple R-squared:  0.6813, Adjusted R-squared:  0.6811 
## F-statistic:  3894 on 25 and 45531 DF,  p-value: < 2.2e-16

Based on the rules of each criterion, the variables having the largest adjusted R2, R2, and Cp value, or lowest BIC value would be considered. However, in our case, all four criteria reached the same four variables, which are temperature, dew point, time of the day, and hour of the day. So we built a new model with these four variables.

Furthermore, we find out that the model can be condensed into three variables by removing ‘TimeOfDay’ and keeping only Temp, Dew and HourOfDay.

As the results displayed below, the reduced model includes only 3 variables but can still achieve R2 as high as the full model. Therefore, we choose the reduced model over the full model and move on to the next phase of the optimization process.

Type of Model Model Formula R-squared Adjusted R-squared
Reduced Model y ~ Temp + Dew + HourOfDay 0.6813206 0.6811456
Full Model y ~ . 0.6971009 0.696848

Temperature and Dew point have high VIF value, which also indicates that two of them are correlated and there is a possibility of multicollinearity.

GVIF
Temp 6.740818
Dew 6.420717
HourOfDay 1.352372

Thus we create 2 more linear models, the first one includes Temp but excludes dew and summary for that model is as follows:

## 
## Call:
## lm(formula = y ~ Temp + HourOfDay, data = Training_Set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -969.35 -118.18   -6.35  106.83 1166.04 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  116.068      4.906  23.660  < 2e-16 ***
## Temp         130.069      1.033 125.872  < 2e-16 ***
## HourOfDay1   -37.007      6.950  -5.325 1.01e-07 ***
## HourOfDay2   -54.397      6.951  -7.826 5.14e-15 ***
## HourOfDay3   -67.623      6.990  -9.675  < 2e-16 ***
## HourOfDay4   -75.328      7.011 -10.744  < 2e-16 ***
## HourOfDay5   -52.041      6.946  -7.492 6.90e-14 ***
## HourOfDay6    33.173      6.940   4.780 1.76e-06 ***
## HourOfDay7   233.731      6.940  33.677  < 2e-16 ***
## HourOfDay8   597.804      6.938  86.160  < 2e-16 ***
## HourOfDay9   534.906      6.934  77.141  < 2e-16 ***
## HourOfDay10  273.178      6.931  39.413  < 2e-16 ***
## HourOfDay11  265.537      6.933  38.301  < 2e-16 ***
## HourOfDay12  356.177      6.936  51.353  < 2e-16 ***
## HourOfDay13  376.520      6.939  54.262  < 2e-16 ***
## HourOfDay14  351.069      6.944  50.560  < 2e-16 ***
## HourOfDay15  359.180      6.946  51.709  < 2e-16 ***
## HourOfDay16  446.037      6.944  64.231  < 2e-16 ***
## HourOfDay17  702.493      6.944 101.169  < 2e-16 ***
## HourOfDay18  807.372      6.938 116.366  < 2e-16 ***
## HourOfDay19  545.384      6.935  78.641  < 2e-16 ***
## HourOfDay20  346.049      6.933  49.914  < 2e-16 ***
## HourOfDay21  215.456      6.933  31.078  < 2e-16 ***
## HourOfDay22  142.440      6.932  20.549  < 2e-16 ***
## HourOfDay23   63.666      6.930   9.187  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 213.9 on 45532 degrees of freedom
## Multiple R-squared:  0.6607, Adjusted R-squared:  0.6606 
## F-statistic:  3695 on 24 and 45532 DF,  p-value: < 2.2e-16

The second linear model excludes temp, but includes Dew and summary for that model is as follows:

## 
## Call:
## lm(formula = y ~ Dew + HourOfDay, data = Training_Set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1032.74  -114.45    -7.58   105.75  1212.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   97.224      5.259  18.488  < 2e-16 ***
## Dew           95.241      1.076  88.493  < 2e-16 ***
## HourOfDay1   -42.613      7.453  -5.718 1.09e-08 ***
## HourOfDay2   -62.680      7.454  -8.409  < 2e-16 ***
## HourOfDay3   -78.972      7.495 -10.536  < 2e-16 ***
## HourOfDay4   -89.695      7.518 -11.930  < 2e-16 ***
## HourOfDay5   -70.686      7.447  -9.492  < 2e-16 ***
## HourOfDay6    12.452      7.439   1.674   0.0942 .  
## HourOfDay7   213.357      7.440  28.676  < 2e-16 ***
## HourOfDay8   582.496      7.439  78.300  < 2e-16 ***
## HourOfDay9   529.141      7.436  71.157  < 2e-16 ***
## HourOfDay10  280.384      7.433  37.720  < 2e-16 ***
## HourOfDay11  287.880      7.433  38.728  < 2e-16 ***
## HourOfDay12  391.723      7.433  52.698  < 2e-16 ***
## HourOfDay13  424.402      7.432  57.101  < 2e-16 ***
## HourOfDay14  407.816      7.433  54.862  < 2e-16 ***
## HourOfDay15  421.478      7.433  56.700  < 2e-16 ***
## HourOfDay16  510.403      7.431  68.689  < 2e-16 ***
## HourOfDay17  764.208      7.432 102.833  < 2e-16 ***
## HourOfDay18  861.166      7.430 115.910  < 2e-16 ***
## HourOfDay19  589.085      7.431  79.278  < 2e-16 ***
## HourOfDay20  377.687      7.431  50.823  < 2e-16 ***
## HourOfDay21  235.947      7.433  31.742  < 2e-16 ***
## HourOfDay22  154.825      7.433  20.829  < 2e-16 ***
## HourOfDay23   69.810      7.432   9.393  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 229.4 on 45532 degrees of freedom
## Multiple R-squared:  0.6098, Adjusted R-squared:  0.6096 
## F-statistic:  2965 on 24 and 45532 DF,  p-value: < 2.2e-16

The model with temperature i.e. the first linear model has higher R2 value, thus the final linear model consists of Temperature and Hour of Day. Hence using 2 variables we can explain 66.07% of variation in our training dataset.

5.3.1 Relative importance

We want to know the relative importance of each variable for our final linear model, which consists of only 2 variables, Temp and HourOfDay. Using relaimpo package, we can conclude from the final linear model that the hour of the day variable provides high contribution for predicting the bike usage.

LMG
HourOfDay 0.760272
Temp 0.239728

The plot of the relative importance is shown below:

5.4 Decision Tree

After performing Regression using linear model, we now perform regression using Decision Trees. Decision Trees have the advantage that they are simple to create and can work with non-linear data.

To save computational time we have stored the Decision Tree model as RDS file, and read the RDS file for our predictions.

We have created a decision tree using Caret package. The “rpart2” method of caret’s train model creates a decision tree by tuning based on the depth of the decision tree. We have specified the max depth to be 10, by using the argument tuneLength=9. We have also performed 10-Fold cross validation and repeated this 3 times.

The decision tree is as follows:

## CART 
## 
## 45557 samples
##    10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 41001, 41002, 41001, 41000, 41002, 41001, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  RMSE      Rsquared   MAE     
##    1        337.1613  0.1568869  263.9271
##    2        319.9661  0.2407091  249.8055
##    4        292.9740  0.3633594  225.0676
##    5        281.2208  0.4134242  217.4486
##    6        269.7184  0.4604437  206.6960
##    7        260.2038  0.4978568  200.7196
##    8        251.3514  0.5314469  193.0507
##    9        241.6853  0.5667628  183.8102
##   10        206.6387  0.6833124  148.6568
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was maxdepth = 10.

As we see from above summary the final depth used for creating decision tree is 10, and the plot for RMSE values with respect to max depth of decision tree is as follows:

As the depth of tree increases the performance increases i.e. the RMSE value of model decreases.

The Decision Tree which we will use for prediction is as follows::

The importance of the variables for decision tree is shown below:

We notice that the variables Temp, WeekdayWeekend, Dew, SeasonWinter and HourOfDay18 have high variable importance. The importance of variable depends on how high it shows up in the decision tree and also depends on number of times that variable repeats.(Therneau, 2019, p. 11)

We have trained our Decision Tree for 2013, 2014, 2015, 2016, 2017 and 2018 dataset. We now perform predictions based on 2019 dataset.

The evaluation metrics of Decision Tree is as follows:

R2 RMSE MAE
0.7319768 194.6294 139.3643

5.4.1 Bagged Decision Tree

Decision Tree have a drawback, they are not flexible and perform poorly with a new sample of data. Thus we use Bagging which is a combination of Bootstrapping the Data and performing Aggregation.

Bagged Decision Trees creates an ensemble of decision trees. Bagged Decision Trees overcome the drawback of Decision Trees by estimating the value based on majority or averages.

To save computational time we have stored the Bagged Decision Tree model as RDS file, and read the RDS file for our predictions.

We have created a bagged decision tree using Caret package. The “treebag” method of caret’s train model creates a bagged decision tree. For bagged Decision tree there is no tuning parameter. We have performed 10-Fold cross validation and repeated this 3 times.

The bagged decision tree is as follows:

## Bagged CART 
## 
## 45557 samples
##    10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 41002, 41001, 41001, 41001, 41002, 41002, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   199.3304  0.7060689  144.2645

The importance of the variables for decision tree is shown below:

We notice that the variables Temp, WeekdayWeekend, Dew, HourOfDay18 and SeasonWinter have high variable importance.

We have trained our Decision Tree for 2013, 2014, 2015, 2016, 2017 and 2018 dataset. We now perform predictions based on 2019 dataset.

The evaluation metrics of Decision Tree is as follows:

R2 RMSE MAE
0.7447782 190.1936 136.4659

Thus using bagged decision tree we got slightly higher r-squared values in comparison to decision trees.

5.6 Random Forest

Finally, we model using Random Forest.

In bagged tree model, different samples of dataset is taken and model is trained to get best possible value. As all variables are used to train the model, there are chances we may overfit the model with full variables and there might be a chance most significant variable of the dataset can also be depressed by other variables in the dataset.

Lets try with limited number of variable and check its performance. we have 10 variables with us, do we need to try all different options? lets use information that we gained from previous models like linear regression, decision tree. More than 80% of the variable importance are obtained by top three variables. So lets start out hunt three variable model.

There are two predominant variable that finds to be important in previous model, but third variable is hard to choose. lets try all different combination of three variable to check the model performance.

We have used random forest package to train model, which takes mtry(number of variables to be used to train each samples) as n/3 by default for regression model.

To save computational time, we have trained the model and loaded the file to check its performance.

Lets use our model to predict the test model.

The evaluation metrics of Random Forest is as follows:

R2 RMSE MAE
0.9321908 97.85963 62.4085

This model gave us best R2 value so far. lets check importance of each variable to identify top three variable.

##             %IncMSE IncNodePurity
## Condition 106.42494     163639162
## Temp       79.90901     862235383
## Dew        35.38582     295842195
## Humidity   62.20948     349574467
## Windspeed  50.81140     109738801
## Holiday   137.36333      53374646
## Weekday   450.50367     456264976
## TimeofDay  37.94861     417106486
## Season     41.75134     212546356
## HourOfDay 165.53598    3019676627

Mean Decrease Accuracy(%IncMSE) and Mean Decrease Gini(IncNodePurity) are calculated on the trained model.

Mean Decrease Accuracy(%IncMSE) - Refers to how much model accuracy decreases if we leave out that variable.

Mean Decrease Gini(IncNodePurity) - is the measure of variable importance based on the Gini impurity index used for calculating the splits in trees.

We could see HourOfDay, Weekday, Holiday, Condition, Temp are the most important variables similar to the other model’s relative importance.

Hang on! let’s try different combinations of variables from 1 to 5 with cross-validation(5 fold cross-validation) and also we repeat the same process for 3 three different samples to get averaged best output.

5.6.1 Tuning/Optimizing Random Forest:

We tune/optimize Random Forest by using grid search and setting mtry in the range of 1 to 5.

To train the model takes quite some time, thus we have already trained the model, stored it and read directly from disk for faster execution and knitting of file.

Different variable models from 1 to 5 combinations, were tried and we could see R2 value increases as the number of variables increases.

## Random Forest 
## 
## 45557 samples
##    10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 36445, 36445, 36447, 36446, 36445, 36445, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   1     298.1751  0.6622144  234.8145
##   2     228.6833  0.7355588  175.3999
##   3     188.9903  0.7937174  140.6115
##   4     161.8658  0.8376181  116.6697
##   5     142.4377  0.8669464   99.4204
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.

The plot describes that the Mean Square error decreases when the number of predictors increases(not surprising).

Let’s use this model, which used various combinations to get accurate prediction with less number of variables to predict the test data.

R2 RMSE MAE
0.8875139 133.6627 92.47352

With the maximum of 5 variable model combination, we could able to achieve a R2 of upto 0.88. Lets check which variables constitute more to this R2 value.

As most of the variables are categorical values, each level is treated as levels and we could see HourofDay and weekday, condition, Temp are the variables that contribute much to our model.

Hourofday - 18 is the best predictor, weekend or not, time of day with working day makes good observation points in determining the total number of bikes consumed by users for a given hour.

5.8 Model Comparison

Model No. of Predictor R2 [Test Data] Variable Importance(desc)
Linear Regression Full Model i.e. all variables 0.69 HourOfDay, Temp, TimeOfDay
Linear Regression 2 0.67 HourOfDay,Temp
Decision Tree Full Model i.e. all variables 0.73 Temp, Weekday, Dew, Season, HourOfDay
Bagged Tree Full Model i.e. all variables 0.74 Temp, Weekday, Dew, HourOfDay, Season
Random Forest 3 0.93 HourOfDay, Weekday, Holiday, Condition, Temp
Random Forest- Grid Search Combination of 1 to 5 variables 0.88 HourOfDay, Weekday, Holiday, Temp, Holiday

Thus random forest performs the best with the R-square value being highest at 0.93.

Chapter 6: Conclusion

In terms of modelling & predictions, we can conclude that :

  • Random Forest works best with the given dataset

  • Maximum R2 value obtained is 0.93

  • Variable Importance are as follows:

    • Hour of Day – Best 6PM-7PM,8-9AM

    • Weekday – Weekend

    • Time of Day – Working Hour

    • Temp – Moderate Temperature – 70–90F

The insights which we got from our analysis is that on a normal day, users tend to ride a bike for commuting to offices, schools, etc. But on weekends & holidays, people prefer to use bikes for travel and leisure activity purposes. We also derive that bikes are preferred maximum in moderate temperatures and users tend to avoid bikes at high temperatures and low temperatures.

Based on our analysis we recommend that during high demand in morning and evening office hours and weekend/holiday, Capital Bikeshare should increase availability during these hours. Thus catering to more users and in turn, securing more profits.

Chapter 7: Bibliography

Motivate International, Inc. (n.d.). Press Kit. Retrieved November 26, 2019, from https://www.capitalbikeshare.com/press-kit.

Capital Bikeshare Discount. (n.d.). Retrieved November 26, 2019, from https://benefits.gwu.edu/capital-bikeshare-discount.

Therneau, T. M. (2019, April 11). An Introduction to Recursive Partitioning Using the RPART Routines. Retrieved November 26, 2019, from https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf.